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Abstract 

Cancers arise from successive rounds of mutation and selection, generating clonal populations that vary in size, mutational 
content and drug responsiveness. Ascertaining the clonal composition of a tumor is therefore important both for prognosis 
and therapy. Mutation counts and frequencies resulting from next-generation sequencing (NGS) potentially reflect a tumor's 
clonal composition; however, deconvolving NGS data to infer a tumor's clonal structure presents a major challenge. We 
propose a generative model for NGS data derived from multiple subsections of a single tumor, and we describe an 
expectation-maximization procedure for estimating the clonal genotypes and relative frequencies using this model. We 
demonstrate, via simulation, the validity of the approach, and then use our algorithm to assess the clonal composition of a 
primary breast cancer and associated metastatic lymph node. After dividing the tumor into subsections, we perform exome 
sequencing for each subsection to assess mutational content, followed by deep sequencing to precisely count normal and 
variant alleles within each subsection. By quantifying the frequencies of 17 somatic variants, we demonstrate that our 
algorithm predicts clonal relationships that are both phylogenetically and spatially plausible. Applying this method to larger 
numbers of tumors should cast light on the clonal evolution of cancers in space and time. 
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Introduction 

Many clones exist within each cancer, and selective pressure 
imposed by environmental factors, most notably treatments 
directed at tumor eradication, favors the emergence of clones 
that grow increasingly resistant to successive rounds of tlierapy. 
Incorporating this intra-tumor heterogeneity into strategies for 
planning, monitoring, and revising cancer treatment could 
improve outcomes for oncologists and their patients. Therefore, 
methods for estimating the number, size and mutational content of 
clones within a patient's tumor are being explored. 

New approaches are being developed to assess the clonal 
content of a given tumor. Methods based on the interrogation of 
individual cells have rehed on the use of fluorescent markers [1,2] 
or single cell sequencing [3-6]. Whereas fluorescence-based 
approaches are inevitably limited by the relatively small number 
of features they can accommodate, single cell sequencing brings 
the highest possible resolution to characterizing an individual 
patient's tumor. Nonetheless, single cell sequencing also faces 
obstacles to its widespread implementation. Evaluating sufficiently 
large numbers of single cells to obtain statistical power can be 
prohibitive, for technical or financial reasons. Additionally, it is 
often difficult to ascertain the identity of the cells being sequenced, 
and details regarding the spatial positioning of cells relative to each 



other and to other cells in the tumor are lost when the single cells 
are obtained. These disadvantages pose significant challenges to 
the widespread adoption of single cell sequencing as a means for 
assessing tumor heterogeneity. 

Complementing single cell approaches are efforts to deconvolve 
clonal subpopulations based on the frequencies of mutated alleles 
within one or more bulk tumor specimens. Shah et al. [7], who 
sequenced a breast cancer at the time of diagnosis and nine years 
later, after metastasis, pointed out that allele frequencies of the 
mutations shared between the two samples could be used to 
segregate primary mutations into those that occur in a dominant 
versus subdominant clone. This insight is the basis for a variety of 
approaches that apply clustering algorithms to mutation allele 
frequencies, including kernel density estimation [8] and Dirichlet 
process modeling applied either to the allele frequencies [9] or to a 
combination of allele frequency, loss-of-heterozygosity status and 
copy number [10-13]. 

Clearly, statistical power to infer variants and, ultimately, clonal 
composition, is increased if multiple samples are available for 
analysis. Accordingly, various studies have examined the progres- 
sion of cancer within one or more patients over time. Sets of 
variants that exhibit similar allele frequencies within a single 
sample are suggestive of a clonal population. Hence, clustering 
methods to identify groups of mutations associated with a single 
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Author Summary 

Cancers arise from a series of mutations that occur over 
time. As a result, as a tumor grows each cell inherits a 
distinctive genotype, defined by the set of all somatic 
mutations that distinguish the tumor cell from normal 
cells. Acertaining these genotype patterns, and identifying 
which ones are associated with the growth of the cancer 
and its ability to metastasize, can potentially give clinicians 
insights into how to treat the cancer. In this work, we 
describe a method for inferring the predominant geno- 
types within a single tumor. The method requires that a 
tumor be sectioned and that each section be subjected to 
a high-throughput sequencing procedure. The resulting 
mutations and their associated frequencies within each 
tumor section are then used as input to a probabilistic 
model that infers the underlying genotypes and their 
relative frequencies within the tumor. We use simulated 
data to demonstrate the validity of the approach, and then 
we apply our algorithm to data from a primary breast 
cancer and associated metastatic lymph node. We dem- 
onstrate that our algorithm predicts genotypes that are 
consistent with an evolutionary model and with the 
physical topology of the tumor itself. Applying this 
method to larger numbers of tumors should cast light 
on the evolution of cancers in space and time. 

clone have been applied. For example, kernel density estimation 
has been applied to allele frequencies from tumor-relapse pairs 
from eight acute myeloid leukemia (AML) patients [14] and from 
seven secondary AML patients [15]. 

An orthogonal approach taken by Newberger et al. [16] 
employs triplet samples of neoplasia, matched normal and 
carcinoma from six patients to infer lineages of various genetic 
events. They characterize each locus in terms of a binary vector 
representing the presence of the mutation across the various 
samples and then group the loci into classes on the basis of these 
vectors. After filtering low frequency classes, the classes are used to 
manually construct a phylogenetic tree. The focus of the study is to 
identify the shared characteristics of the evolutionary process 
across six patients with breast cancer. 

In the current study, we adopt an alternative approach to 
identify clonal structure. Rather than measuring allele frequencies 
in multiple samples from the same patient over time, we physically 
subdivide a single breast cancer specimen and measure allele 
frequencies within each subsection (Figure 1). We are aware of two 
previous studies that have adopted such an approach. Yachida et 
al. [17] analyzed seven metastatic pancreatic cancers, sequencing 
from multiple samples per patient. Clones are initially defined 
relative to sample types (peritoneal, liver and lung metastases). 
Subsequently, the tumors from two patients are resected and a 
clonal phylogeny is inferred manually. More recently, Gerlinger et 
al. [18] carried out exome sequencing followed by targeted deep 
sequencing on samples from four patients with renal carcinoma. 
Each primary tumor was divided into 9 regions, and a phylogeny 
was manually constructed by assuming that higher alternate allele 
frequencies correspond to earlier mutations. In neither of these 
studies was an algorithm proposed to automatically infer from 
such data both the clonal genotypes and the relative frequencies of 
the clones within each subsection. 

The method proposed here bears some similarity to the recently 
proposed Tree Approach to Clonality (TrAp) method [19]. The 
TrAp algorithm aims to identily the number, relative frequencies 
and genotypes of clones within a tumor using a formalism 
somewhat similar to ours, based on matrix decomposition. 



However, rather than analyzing data from multiple sections, the 
authors use as input a single set of variant allele frequencies and 
then constrain the resulting optimization problem by introducing a 
series of four assumptions about cancer evolution. It is not clear 
whether the method can easily generalize to analysis of data from 
multiple sections or multiple time points. 

Here we describe a generative binomial model that incorporates 
information from multiple sections from a single tumor at a single 
time point to infer the frequencies and genotypes for a specified 
number of clones. An implementation of our algorithm is available 
through Bioconductor as an R package called Clomial (http:/ /www. 
bioconductor . org/ packages / release/ bloc /html/ Clomial. html) . We 
use Clomial version 1.1.7 to apply this approach to a breast cancer 
specimen and demonstrate that the results from our model predict 
relationships that are phylogeneticaUy and spatially plausible. 

Results 

1 Inferring the clonal architecture of a tumor 

We assume that a tumor is comprised of multiple populations of 
cells ("clones"), each with a unique genotype, and that these 
populations are heterogeneously distributed within the tumor 
itself We collect, from several physical subsections of the tumor, 
shotgun sequencing reads. We also collect sequencing data from a 
non-tumor subsection from the same patient. Using the called 
genotypes from the normal subsection, and restricting ourselves to 
positions that are homozygous in the normal subsection, each read 
from a tumor subsection exhibits either a normal allele or a variant 
allele at each location. We exclude positions that exhibit 
homozygous normal alleles in all of the tumor subsections. Our 
goal is to infer, from the remaining mutated positions, the 
genotype of each clonal population and their relative frequencies 
within each physical subsection of the tumor. 

Formally, the problem can be stated as follows. Note that we use 
bold face letters for random variables, and that Ai and A' 
respectively denote the row and the column of matrix ^. We 
are given two primaiy input matrices RnxM and Xf^ ^ m> where N 
is the number of mutated loci, M is the number of subsections (of 
which one is normal and M —I are tumor), Rjj is the total 
number of reads (i.e., the coverage) at locus / in subsection j, and 
Xjj is the number of cancerous reads (those supporting the 
mutation) at locus / in subsection j. We assume, without loss of 
generality, that the first of the M subsections corresponds to 
normal tissue, and that the remaining M—l subsections are from 
the tumor. In addition, we consider C, the number of distinct 
clones in the tumor, as a hyperparameter, and train a model based 
on a given value of C. We assume that the first clone corresponds 
to the normal cell population and the tumor is composed of C — 1 
tumor clones. Later, we will discuss whether C can be estimated 
from the data. Our task is to infer two matrices: a clone frequency 
matrix Pc x M in which P^j is the proportion of cells of clone c in 
subsection j, and a genotype matrix Zfj x c in which Z,- ^ = 1 if clone c 
has the variant allele at locus /, and Z,;,. = 0 otherwise. The first 
column of Z contains all zeroes because it represents the "normal 
clone." By definition, each column of P sums to 1. Also, by 
construction, the first column of X corresponds to the normal 
subsection and hence consists almost entirely of zeroes, although 
small non-zero counts may be possible due to contamination from 
tumor or due to sequencing error. If the first column of X 
consisted entirely of zeroes, then we would expect the first column 
of P to be of the form (1,0, .. . ,0), but in order to allow for the 
possibility that the allegedly normal subsection can have slight 
tumor contamination, we infer the first column of P (as well as the 
other M—l columns). 
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Figure 1. Inference of tumor clonal content. A collection of subsections of a tumor are subjected to next-generation sequencing to measure, 
across a common set of genomic loci, counts of two alleles — the normal allele that was observed In a matched normal sample at that locus, and a 
variant allele. The resulting counts matrices are provided as Input to an inference procedure that estimates the clonal genotypes and frequencies. 
doi:l 0.1 371/journal.pcbi.l 003703.g001 



We propose to solve this problem using a generative model 
whose parameters are learned via expectation-maximization (EM) 
[20]. Accordingly, we define a matrix of hidden variables ZjvxC 
representing the unknown genotypes of the clones; for instance, if 
1) then the c"' clone has a tumor allele at the f'^ locus. We 
assume that each Z,-,^ follows an independent BernouUi distribu- 
tion with parameter jU, ^, i.e., 

Zi,,~Bem(^j,.,). (1) 

We also assume that if a mutation is present in a particular 
clone, then at that locus the clone is heterozygous with copy 
number erjual to 1 . Therefore, for subsection j, if clone c has a 
mutation at locus i (Z, ^ = 1), then its contribution to the observed 

count of cancer alleles is by - Pcj, half of its proportion in the 

subsection. Conversely, if a clone does not have a mutation at i 
(Z,;c = 0), then it does not contribute to the count of variant 
alleles. By summing up the contributions of all clones, we obtain 
the total probability that an observed read corresponds to a 
variant allele rather than a normal allele. Therefore, the 
probability that a read contains the variant allele at locus / in 
subsection j is given by 

nij=\zrP, (2) 

where Z, is the row of Z, and P-' is the 7* column of P. 
Finally, we introduce a matrix XjVxM of random variables 
representing the observed data, where Xy is the number of reads 
exhibiting the variant allele at locus i in subsection j. This matrix 
encodes our primary assumption about the distribution of the 
data: for each i and j, we observe an independent sample of X,y 
that has a binomial distribution with two parameters Rij and TCy, 
i.e., 

X,-,~Bin(7?,,-,7ty). (3) 



The first parameter of this distribution Rfj is the (known) total 
number of reads at locus i in subsection j. The second parameter, 
Uij, is the probability of observing a variant allele; it will be 
inferred by EM. 

Given the joint distribution Pr(X,Z|6) over observed variables 
X and latent variables Z, governed by parameters 6 = (P,fi), our 
goal is to maximize the likelihood function Pr(X|(?). We do so 
using EM, exploiting three assumptions: (1) that each subsection 
contains non-zero normal contamination, i.e., Pi/>0 for all j, (2) 
independence of the M subsections from each other, and (3) 
independence of mutations from each other. TIk- first assumption 
is based on the widely accepted difficulty associated with obtaining 
perfectly pure samples of tumor cells [21,22]. The two indepen- 
dence assumptions essentially state that each locus and each 
sample is informative. These assumptions are unavoidable: in the 
presence of very high dependence, only very limited information 
about the underlying clonal composition of the tumor would be 
provided by the loci and samples. Furthermore, it is worth noting 
that these independence assumptions are made conditional on the 
parameters in the model: that is, the elements of X are 
independent conditional on Z and P. In other words, if we knew 
the true underlying parameters for the model (that is, the true 
genotypes for the clones, and the true proportion of each clone 
present in each sample), then the actual number of "tumor" reads 
that we would observe for each locus-sample pair would be 
independent. 

While the formulation of our inference problem shows some 
similarity to well-studied matrix factorization problems [23-25], 
such techniques cannot be directiy apphed here. Unlike most 
matrix factorization techniques, which assume a normal distribu- 
tion, our observations are binomially distributed. Moreover, the 
elements of the latent matrix Z are binary, and each column of P 
must sum to 1. These constraints required us to develop a 
customized inference algorithm. 

2 Log likelihood 

To frame the EM optimization, we consider the following 
complete-data log likelihood function of the model: 
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£ = logPr(X,Z|0), (4) 

which can be computed as follows (for details see Note S4 in Text 
SI): 



+ J2 (Z,;clog(ft,,) + (l -Z,,,)log(l - ft,,)), 



(5) 



where 7Cy= 2'^rP-'- 

3 Expectation maximization (EM) algorithm 

Our goal is to find the parameters 6 = {P,n) which maximize the 
likelihood. Because our model involves the hidden variable Z, we 
cannot direcdy maximize the C given in Equation 5 with respect to 
6. Instead, we use the EM algorithm to fit the model to the data 
[26]. EM is an iterative algorithm with two steps — E (for 
expectation) and M (for maximization) — in each iteration. In the 
E step, we use the current estimates of the parameters, 6°^'^, to 
compute the conditional expectation of £. In the M step, we find the 
new parameters 0°** that maximize the conditional expectation. 

Overview. In this section, we present an overview of the EM 
algorithm, followed by the specific details of the E and M steps for 
our application. 

1. Randomly initialize the parameters 9°^'^. 

2. Repeat the following until a convergence criterion is satisfied 
(such as insignificant improvement in the log likehhood; see 
Equation 12). 

(a) E Step. Evaluate the posterior Pr(Z|X,e°''') using the 
current parameter values. Because each locus is indepen- 
dent, we will compute Pr(Z,|X,,0°''') for l<i<n. This can 
be done by Bayes' theorem. 



Pr(Z,|X,-,0°''*) = 



Pr(X,|Z,-,e°''')Pr(Z,|0 
E Pr(X,|Z,- = z,OPr(Z, = z|0 



ze{0,l} 



C 



(b) M Step. Evaluate 6"^ from 

0°*^'^-^argmaxe(0|0°i<') 

e 

where Q{6\9°^'^) is the following expected log likelihood with 
respect to Z conditioned on X and 6°^^: 



(7) 



e(0|r'') = Ez|x,ecid[logPr(X,Z|e)] 

= ^Pr(Z|X,e°''*)logPr(X,Z|e). 
z 

(c) Update the parameters by 



Computation for the E step. To compute the posterior 
Pr(Z|X,0°'''), we need to compute Pr(Z,|0°''') and Pr(Xi|Z,-,0°'^) 



for the i locus (see Equation 6). The latter is equal to the product 
of binomial probabilities because the samples are assumed to be 
independent. Using Equations 2 and 3, we have 



Pr(X,|Z,-,r<') = n Pr(X,v|Z,-,0 



Also, Pr(Z,|(?°''') is the product of BemouUi probabilities. From 
Equation 1, we have that 



Pr(Z,|e°''*)= n (/x°''*)^'.<^(l-jU°f)'-^'.<^. 



Computation for the M step. To get 6 , we maximize 

Q{e\e°^^) defined in Equation 7. First, we split g(0|0°'^) into two 

terms such that one term depends only on P, and the other term 
depends only on fi. This simplifies the process of finding the 
optimal parameters. 



Q(e\ 6°'^) = Ez|x,floid [log Pr(X,Z| 0)] 

= Ez|x,floid[log(Pr(X|Z,e)Pr(Z|0))] 
= ]Ez|x,floid[logPr(X|Z,P)] 



(9) 



+ Ez|x sold [log Pr(Z|^i)], 



where we let $(P) —E^i^ gold [log Pr(X|Z,P)] and ^(m) ■= 
E^ijj gold [log Pr(Z|^i)] for simplicity of notation. Similar to 
Equation S7 in Note S4 in Text SI, we have used the fact that 
conditional on Z and P, X is independent of fi, as well as the fact 
that conditional on /.(, Z is independ(;nt of P. 

Computing fi'^™ Now that we have separated into 
two terms, we can first update P by only maximizing ®(P). We 
solve the following constrained optimization problem to get P'^^^ 
(for details see Note S5 in Text SI): 



' := argmaxO(P) 

c (10) 
such that y/.c : 0 < Pcj and V/ : 'EPcj = l- 



We solve our simplified optimization problem using a quasi- 
Newton method called BFGS-B [27,28]. The original BFGS 
algorithm uses the gradient to approximate the Hessian matrix of 
second derivatives; therefore, the algorithm is very efficient when 
the gradient is available [29,30]. BFGS-B is a variant that can 
handle simple box constraints. We compute the first derivative of 

with respect to each entry of V by the chain rule, and provide it 
to BFGS-B for faster convergence (Note S2 in Text SI). 

Computing /j"™ Recall that T(/z) =]Ez|xg.'41ogPr(Z|/<)] is 

the only part of the expected log likelihood which is a function of \l 
(see Equation 9). Therefore, we can compute /.i"™ by maximizing 
*I'(^). Because we are assuming that conditional on ^, the 
elements of Z are independent, we just need to maximize each 
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term in the foUowing sum: 



"I'W=Ez|x,floM[logPr(Z||<)] = ^ ^]E2|x,eoid[logPr(Zi,,|/i,.,)]. 

i=l c = l 



The first column of Z corresponds to the normal cells, Z,_i =0, 
hence ^u,- 1 = 0 for all i. For 1 < c < C, we need to maximize, with 
respect to yu,- ^, 

= Pr(Z,,, = 1 |X,0°i<') log Pr (Z,-,, = 1 ,) 

+ Pr (Z,.,, = O|X,0°id) log Pr (Z,,, = 0| /i,,,) ( 1 1) 

= Pr(Z,,, = l|X,e°i^)log(/i,.,) 

+ (1 -Pr(Z,,,= l|X,0)log(l -/^,,,). 

By single-variable calculus, the value of yU,- that maximizes (1 1) 
is Pr(Z,-,= l|X,e°'''). For the c* clone, the probability 
Pr(Z,- (, = 1|X,0°''') can be computed by marginalizing Z,- over all 
other clones: 



Pr(Z,-,,= l|X,0°'^) = 



E 

Z,e{ {0,1}^|Z,-,, = 1} 



Pr(Z,|X,0°'<'). 



Because the posteriors Pr(Z,|X,f)°''*) are easy to compute by 
Bayes' rule (Equation 6), /i"** can be updated as follows: 



J2 PT{Zi\x,e°''^),i<i<N,i<c<c. 

Z,e{{0,l}'^|Z,-,, = l} 



In principle, for each solution, the genotype matrix Z can be 
obtained by rounding the inferred ji. However, in practice, the 
inferred values in yU were always exacdy 0 or 1 (with observed 
differences <10-^''). 

Initializatioii and convergence. We initialize elements of 
PcxM with values independently sampled from a Uniform [0,1] 
distribution. Then we standardize each column such that the sum 
of the proportions of each clone in a subsection is 1. Similarly, we 
randomly initiaUze the matrix /UatxC with values independendy 
sampled from a Uniform [0,1] distribution. In practice, we run EM 
to convergence from multiple random initializations for n and P, 
and we choose the run that results in the highest likelihood. 

The convergence criterion is based on the change in the 
expectation of the complete-data log likelihood. Specifically, we 
stop the EM iterations if: 



E. 



lold 



[^oldj 



<a 



(12) 



where a is a small positive number. We set o; = 
experiments. Using Equation 5, we can compute lEzifli-C] each 
iteration as follows: 



m.\em- 



E [ E (log ( X ' ) ^ 

+ E (-"v l0g(l"!,c) + (1 - i«i,c)l0g(l - Hi,c)) ] 



V-X,v)l0g(l-7I,y) 



(13) 



where the sums are over locus indices 1 < / < A^, subsection 
indices l<j<M, and clone indices 1 < c < C. We used the fact 
that Z,-,, is binary and Pr(Z,_c = l|(?) = jU,-^ to derive the above 
equation. 

4 Simulation results 

To vaUdate our implementation of the EM optimization 
procedure and to understand our model's behavior, we produced 
simulated deep sequencing data and measured the extent to which 
the model successfully recovers the true clonal structure of the 
data. 

For each simulation, we began by randomly generating four 
matrices. First, we generated a simulated matrix Rnxm of total 
read counts with respect to a fixed number (A'^ = 20) of loci and a 

fixed number (Me{3, . . . ,15}) of subsections with a mean 
coverage of 1000 reads per locus. The matrix was generated 
by independendy sampling each column (corresponding to a 
single subsection) from a multinomial distribution 

Multinomial(1000jV,-^, . . . ,-^), where the parameters lOOOA' 

and — correspond to the total number of trials, and the 
N 

probability of success for each of the loci, respectively. 
Second, for any clone number Ce{c|3 <c< 5 and c< M}, we 
generated a corresponding Boolean matrix Zjy x c, in which the 
entry at row i and column c indicates whether locus ; exhibits the 
variant allele in clone c. Entries in Z were generated 
independently from a Bernoulli distribution with a probability 
of success m = OJ, with the exception of the first ("normal") 
column of Z, which contains all zeroes. Third, we generated a 
clone frequency matrix PcxM as follows: each element of P is 
independendy drawn from a Uniform [0,1] distribution, and then 
each column of P was divided by the column sum, so that the 
columns summed to 1. We then set P' =(1,0, . . . ,0) so that the 
first column of P corresponds to the normal subsection. Finally, 
for each locus / and subsection j, we generated the observed 
number of variant alleles Xij by sampling from a binomial 
distribution with parameters Rij (representing the total number 

of reads) and -Z, -P' (representing the probability that a given 

read corresponds to the variant allele). This last step complies 
with our primary assumption about the distribution of the data 
(Equation 3). 

We ran the EM algorithm using the simulated data R and X 
and then evaluated the extent to which the estimated clone 
frequency matrix P and mutation probabiUty matrix jl differed 
from the corresponding true matrices P and Z. Specifically, we 
computed the genotype error ez, defined as 

J TV c 

^^=^EElA,-.c-z,-.| 

1 = 1 c=l 

and the clone frequency error ep, 
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J C M 
c=l 7=1 

Note that, because we did not know which columns of jX 
correspond to which columns of Z, we compared Z to every 
permutation of the columns of /i and selected the permutation that 
resulted in the smallest genotype error. The selected permutation 
was then also used in the calculation of the clone frequency error. 

Our simulation results (Figure 2) exhibit two primary trends. 
The overall error rate, as measured by either genotype or clone 
frequency error, decreases systematically as the number of 
subsections increases, and increases as the number of clones 
increases. Overall, both error rates are low, especially for C = 3. 
The observed trends are expected: for a fixed number of clones, 
the availability of more subsections leads to more accurate 
estimation of the true parameter values; and for a fixed number 
of subsections, the presence of more clones leads to a greater 
number of parameters that must be inferred, leading to greater 
error in estimation. 

To assess the affect of sequencing error on the performance of 
Clomial, we added noise to the simulated data and repeated the 
above experiments. Specifically, we modeled noise by Bernoulli 
random variables with probability of success interpreted as the 
probability that a non-tumor allele is read as a tumor allele or vica 
versa. Running the EM algorithm on the noisy data revealed that 
Clomial is robust with respect to noise for all reasonable levels of 
sequencing error (Figure S6) in Text SI. 

5 Application to a primary breast cancer 

We obtained breast cancer tissue from a 44 year old 
premenopausal female with infiltrative ductal carcinoma (IDC) 
with ductal carcinoma in situ (DGIS), stage pTlc pNl, Grade II/ 
III, estrogen receptor (ER) positive, progesterone receptor (PR) 
positive and Her2 negative. Axillary lymph node dissection 
revealed that one out of 1,3 nodes was positive for metastatic 
disease. A total of 6 tissue sections were obtained, including 2 
sections from adjacent normal breast tissue, 3 from the primary 
breast cancer, and 1 from the positive lymph node. The tumor 
content, including both IDC and DCIS, ranged from 40% to 55% 
in the primary tumor and axillary lymph node tissue sections based 
on pathological examination. For subsequent analysis, each tissue 
section was subdivided into subsections (Figure 3). 

To identify mutations and rjuantify allele frequencies, we 
performed two rounds of DNA sequencing. Initially, DNA was 
extracted from each individual subsection and subjected to exome 
capture followed by Dlumina sequencing. Variants were detected 
independentiy in each subsection using the SeattleSeq Annotation 
Server. We focused on single nucleotide variants and short indels 
that exhibited a coverage of > 1 5 reads in at least one of the 
subsections, ranking them using DeepSNV [31] and Fisher's exact 
test (Methods). This analysis produced an initial set of 28 1 variants 
(Dataset SI). 

To better quantify the allele frequencies at these loci, we 
designed primer pairs surrounding each locus and used these 
primers to perform a second round of targeted DNA sequencing. 
This experiment successfully sequenced 244 of the 281 loci, with a 
mean and median coverage of 1615 and 1118, respectively, reads 
per locus. Each of these loci was individually validated by visual 
inspection using the Integrative Genomics viewer (IGV). Manual 
inspection showed that many of the initially identified mutations 
were flanked by homopolymer repeats, suggesting that the 
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alternate alleles were read calling errors, rather than true 
mutations [32]. For all downstream analysis we focused on a set 
of 17 confirmed somatic variants. For clarity of presentation, we 
refer to each somatic variant by the chromosome where it resides, 
appending a letter if more than one somatic variant occurred 
within a (iiromosomc (Table SI in Text SI). The targeted 
sequencing thus produced two 17-by-12 matrices containing, 
respectively, the total coverage and the tumor allele count at each 
locus (Table SI in Text SI). Visual inspection of the allele 
frequency profiles shows, not surprisingly, a markedly different 
pattern of allele frequencies among the subsections from primary 
and metastatic sites (Figure 3). In addition, several of the samples 
(e.g., Pl-4 and P3-1) exhibit consistently lower frequencies across 
all loci, presumably indicating a higher prevalence of normal cells 
within these samples. 

We applied our EM optimization procedure to the two counts 
matrices, varying the number of assumed clones from C = 3 up to 
C = 6. For each value of G, we ran EM 100,000 times from 
different random initializations, and we selected the solution with 
the highest likelihood (Figure 4). The resulting three-clone solution 
identifies two mutations, chr4a and chr9b, that occur in both the 
primary and metastatic samples and segregate the remaining 
mutations into nine that occurred in the primary tumor and six 
that occurred in the metastatic lymph node. The four- and five- 
clone solutions further subdivide the primary tumor mutations, 
and the six-clone solution separates the two metastatic mutations 
into distinct clones. 

To better understand the inferred clonal landscape, we 
investigated the relationship between clone frequencies and the 
anatomy of the three primary and one metastatic tumor sections. 
We hypothesized that clone frequencies should vary smoothly 
between adjacent subsections, reflecting the physical spread of 
successful clonal populations. This hypothesis is supported by the 
data (Figure 5 and Figure SI in Text SI). The trends are most 
striking in sections PI and P2, for which we obtained four separate 
subsections. In each case, the primary clone frequencies var\' in a 
monotonic fashion as we traverse the sample. Given that the EM 
infcrc-ncx- procedure was pr()\ided with no information about 
which subsection was derived from which section, nor the relative 
orientation of the subsections to one another, the smoothly varying 
frequencies among adjacent subsections provides evidence that the 
method has successfully identified true clonal variation. 

6 Tumor phylogeny 

Cancer progression is an evolutionary process in which clones 
accrue mutations over time, forming new clones. Accordingly, it 
should be possible to organize the clonal progression of a tumor 
into a phylogenetic tree- with the founder clone at the root. We 
therefore investigated whether the clones inferred by our EM 
procedure obey some simple phylogenetic constraints, with two 
complementary goals. First, because our EM procedure makes no 
use of phylogenetic constraints, this analysis can provide further 
evidence for the validity of our inferred solutions. Second, the 
phylogenetic analysis has the potential to provide significant 
insights into the clonal and mutationjil history of this specific 
cancer. 

We started with the C = 3 solution to our EM algorithm, 
manually constructing a phylogenetic tree in which each node is a 
clonal population, and edges are marked with the mutations that 
occurred in the evolution from the parent clone to the offspring 
(Figure 6A). This particular tree shows two founder mutations, 
chr4a and chr9b, occurring prior to metastasis, six mutations 
occurring along the metastatic lineage, and nine along the primary 
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Figure 2. Simulation results. The figure plots the mean (A) genotype error ez and (B) clone frequency error ep as a function of the number of 
subsections. Each mean is computed over 100 simulated data sets. For each data set, the EM optimization is repeated from 10 different random 
initializations, and the results corresponding to the largest log likelihood are reported. 
doi:10.1371/journal.pcbi.1003703.g002 



lineage. This is the only phylogenetic tree that is consistent with 
the inferred clonal genotypes. 

In contrast, for the solutions inferred from the EM algorithm 
assuming C = 4 through 6, we found that it is not possible to 
construct a tree without requiring that the same mutation occur 
independently along multiple branches. We therefore considered 
all possible "nearby" trees (where "nearby" means that, among 
the distinct rows of the genotype matrix, the two trees differ by 
only one bit) that produce a valid phylogenetic tree with no 
repeated mutations. For example, for the C = 4 solution, we 
evaluated the likelihood of six nearby trees, yielding log-likelihoods 
of -28482, -21282, -7500, -6692, -5659, and -4333 (Table 
S2 in Text SI). The highest of these likelihoods is —4333, 
compared to — 4244 for the solution initially inferred by EM. The 
selected solution requires changing only one bit in the genotype 
matrix from "0" to "1" (indicated by asterisks in Figure 4). The 
resulting phylogenetic tree (Figure 6B) closely resembles the C = 3 
tree, except that one mutation initially assigned to the metastatic 
clone C3 is instead assigned to clone C2 in the C = 4 tree. Also, the 
nine mutations associated with the primary section in the C = 3 
tree are further subdivided into three that occur shortly after 
metastasis and six that lead to clone C 1 . Reassuringly, the C = 5 
and C = 6 solutions, constructed in a similar fashion (Figure 6C- 
D), are largely consistent with this story, each introducing a 
subdivision among the existing sets of mutations to produce a 
larger set of clones. Among these trees, the only inconsistencies 
concern (1) three mutations (chr5, chr9a and chr20b) that occur 
later according to the C = 4 solution than according to the C = 5 
or C = 6 solutions and (2) two mutations (chr 1 and chr4b) that are 
assigned their own branch, directly off the normal clone, in the 
C = 5 and C = 6 solutions. In practice, the chance that a randomly 
generated genotype matrix would produce a valid phylogenetic 
tree is vanishingly small (Note S3 in Text SI). Therefore, the fact 
that each of our inferred solutions very nearly produce a valid 
phylogenetic tree provides evidence for the validity of these 
solutions. 

We also investigated the extent to which the observed mutation 
frequencies obey the phylogenetic tree. In principle, a mutation 
that occurs earlier in the evolution of the cancer should have a 
higher frequency than mutations that occur later along the same 
lineage because a child clone necessarily contains all of the 
mutations belonging to its parent clone. This investigation is 



hampered, however, by copy number variation. In practice, we 
cannot directiy compare the allele frequencies of two distal sites 
because the observed allele frequencies are actually the product of 
mutation frequency and copy number. Empirically, we observe 
variation in copy number along the genome and differences in 
copy number variation from one subsection to the next (Figure S2 
in Text SI). A consistent duplication of a large portion of 
chromosome 8 is known to occur commonly in breast cancer [33]. 
We were lucky, however, that two of our mutated loci occur quite 
close to one another on chromosome 9 (chr9a and chr9b, 
separated by only 3.3 Mbp). Given the observed data, the 
hkelUiood that a change in copy number occurring between these 
two loci is small, thereby allowing us to safely compare the 
corresponding mutation frequencies. Across all nine primary 
tumor subsections, we observe that the frequency of the parent 
mutation (chr9b) is higher than that of the child mutation (chr9a). 
Hence, these mutation frequencies are consistent with the inferred 
phylogeny. 

To assess the stability of our inference, we performed leave-one- 
out analysis and compared the inferred phylogenies as follows. We 
held out each of the 12 tumor subsections one at a time and 
trained the model using the data from only 1 1 subsections for the 
case of C = 4. When samples pl-1 or pi -3 were excluded, the 
inferred genotypes were exactly the same as the genotype obtained 
from the fuU data. Excluding any of the other of 10 subsections 
resulted in a genotype which was different only in one bit; namely, 
the mutation chr4a was predicted to be present in all clones. 
However, this difference did not affect the inferred phylogeny 
because the change of this bit was in fact required to build a valid 
phylogenetic tree (Figure 4). In other words, by excluding any of 
the 12 tumor subsections, the inferred genotype always led to the 
same valid phylogenetic tree, which suggests that our algorithm is 
stable. 

Discussion 

Once a tumor has been resected, clinicians pay a great deal of 
attention to characterizing its anatomy. Features such as necrosis, 
extension beyond normal anatomical boundaries, and microvas- 
cular invasion convey important prognostic information. In 
addition, the cancer cells within any given tumor are frequently 
heterogeneous with respect to features such as differentiation state, 
the fraction of cells undergoing mitosis (as determined by Ki67 
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Figure 3. Anatomic locations of tlie sections, and corresponding allele frequencies. The figure shows (top) the anatomic locations of the 
three primary and one metastatic sections and (bottom) the corresponding alternative allele frequencies for each subsection. The full coordinates for 
each of the 17 loci are provided in Dataset S2. 
doi:10.1371/journal.pcbi.1003703.g003 



PLOS Computational Biology | www.ploscompbiol.org 



8 



July 2014 I Volume 10 | Issue 7 | e1003703 



Inferring Clonal Composition of a Breast Cancer 



I 
I 





C- 


— o 




C— 4 






C- 










C— fi 








CI 




CI 


C2 




CI 


C2 


cs 


C4 


CI 


C2 


cs 


C4 


c 




1 

± 


1 

X 


1 

X 


1 * 

X 


1 

X 


1 

X 


1 

X 


n 
u 


1 

X 


1 

X 


1 

X 


0 
u 


1 

X 


1 

X 




1 

1 


1 

X 


1 

X 


1 

X 


1 

X 


1 

X 


1 

X 


u 


1 

X 


1 

X 


1 

X 


0 


1 

X 


1 

X 


clirl 


1 

J. 


n 


1 

X 


0 


0 


0 
u 


0 
w 


X 


n 

u 


0 


n 
u 


1 

X 


0 


0 
u 


L-lil '-± U 


1 

X 


n 


1 

X 


0 


0 


0 
u 


0 


1 

X 


0 


0 


n 
u 


1 

X 


n 

u 


0 

VJ 




1 

X 


0 
u 


1 

X 


1 

X 


0 


1 

X 


1 

X 


n 

u 


0 
u 


1 

X 


1 

X 


0 
w 


n 
u 


VJ 


r* M T" T r\ 


1 

X 


n 


1 

X 


1 

X 


0 
u 


1 

X 


1 

X 


n* 


n 

u 


1 

X 


1 

X 


0* 
w 


n 
u 


VJ 


r'riT'l / r» 


1 

X 


n 


1 

X 


1 

X 


0 


1 

X 


1 

X 


n 


n 

u 


1 

X 


1 

X 


0 


0 


0 

VJ 


n T" T 


1 

X 


n 


1 

X 


n 

u 


0 


1 

X 


1 

X 


n 

u 


u 


1 

X 


1 

X 


0 


0 


VJ 




1 

X 


n 


1 

X 


n 

u 


0 


1 

X 


1 

X 


n 
u 


0 


1 

X 


1 

X 


0 


n 


0 

VJ 


Clii z u u 


1 

X 


n 

u 


1 

X 


n 


0 


1 

X 


1 

X 


n 
u 


0 
u 


1 

X 


1 

X 


0 
w 


n 

u 


0 

VJ 




1 

X 


n 


1 

X 


n 


0 


1 

X 


0 


n 
u 


n 

u 


1 

X 


n 

u 


0 


0 


0 

VJ 




0 


1 

X 


n 

u 


1 

X 


0 


n 
u 


1 

X 


n 

u 


0 


0 


1 

X 


0 


0 


0 

VJ 


chr3c 


0 


1 


0 


0 


1 


0 


0 


0 


1 


0 


0 


0 


1 


0 


chrl7a 


0 


1 


0 


0 


1 


0 


0 


0 


1 


0 


0 


0 


1 


0 


chr20a 


0 


1 


0 


0 


1 


0 


0 


0 


1 


0 


0 


0 


1 


0 


chrSb 


0 


1 


0 


0 


1 


0 


0 


0 


1 


0 


0 


0 


0 


1 


chr6 


0 


1 


0 


0 


1 


0 


0 


0 


1 


0 


0 


0 


0 


1 



Nl-1 Pl-1 Pl-2 Pl-3 Pl-4 P2-1 P2-2 P2-3 P2-4 P3-1 P3-2 Ml-2 Ml-1 



C=3 



CO 
CI 
C2 



1.00 
0.00 
0.00 



0.47 
0.51 
0.02 



0.43 
0.55 
0.03 



0.57 
0.43 
0.00 



0.75 
0.25 
0.00 



0.77 
0.21 
0.01 



0.51 
0.48 
0.01 



0.61 
0.38 
0.00 



0.55 
0.44 
0.00 



0.44 
0.54 
0.01 



0.63 
0.36 
0.02 



0.77 
0.00 
0.22 



0.84 
0.00 
0.16 



C=4 



CO 
CI 
C2 
C3 



0.99 
0.00 
0.00 
0.00 



0.31 
0.37 
0.32 
0.00 



0.31 
0.44 
0.24 
0.00 



0.54 
0.40 
0.06 
0.00 



0.74 
0.23 
0.03 
0.00 



0.72 
0.16 
0.12 
0.00 



0.44 
0.41 
0.15 
0.00 



0.59 
0.36 
0.05 
0.00 



0.50 
0.40 
0.10 
0.00 



0.37 
0.47 
0.17 
0.00 



0.56 
0.29 
0.15 
0.00 



0.75 
0.00 
0.00 
0.25 



0.82 
0.00 
0.00 
0.18 



C=5 



CO 
CI 
C2 
C3 
C4 



0.99 
0.00 
0.00 
0.00 
0.00 



0.10 
0.07 
0.48 
0.35 
0.00 



0.07 
0.42 
0.16 
0.34 
0.00 



0.29 
0.44 
0.00 
0.27 
0.00 



0.60 
0.26 
0.00 
0.14 
0.00 



0.61 
0.02 
0.22 
0.15 
0.00 



0.22 
0.42 
0.07 
0.29 
0.00 



0.36 
0.37 
0.02 
0.25 
0.00 



0.27 
0.44 
0.02 
0.28 
0.00 



0.11 
0.48 
0.08 
0.33 
0.00 



0.40 
0.28 
0.10 
0.22 
0.00 



0.74 
0.00 
0.00 
0.00 
0.25 



0.82 
0.00 
0.00 
0.00 
0.18 



C=6 



CO 
CI 

C2 
C3 
C4 
C5 



0.99 
0.00 
0.00 
0.00 
0.00 
0.00 



0.10 
0.07 
0.48 
0.35 
0.00 
0.00 



0.07 
0.42 
0.16 
0.34 
0.00 
0.00 



0.29 
0.44 
0.00 
0.27 
0.00 
0.00 



0.60 
0.26 
0.00 
0.14 
0.00 
0.00 



0.61 
0.02 
0.22 
0.15 
0.00 
0.00 



0.22 
0.42 
0.07 
0.29 
0.00 
0.00 



0.36 
0.37 
0.02 
0.25 
0.00 
0.00 



0.27 
0.44 
0.02 
0.28 
0.00 
0.00 



0.11 
0.48 
0.08 
0.33 
0.00 
0.00 



0.40 
0.28 
0.10 
0.22 
0.00 
0.00 



0.52 
0.00 
0.00 
0.00 
0.12 
0.36 



0.66 
0.00 
0.00 
0.00 
0.10 
0.24 



Figure 4. Inferred clonal genotypes and frequencies. The top table lists, for each of the 17 loci, the inferred clonal genotypes using the EM 
procedure, assuming C = 3, 4, 5 and 6. In each case, the normal clone (CO) is omitted from the inferred matrix Z, because its genotype consists entirely 
of zeroes by construction. For reference, each distinct genotype pattern per locus is assigned a unique color according to the scheme from Figure 6. 
In the table, bits with asterisks were flipped based on the phylogenetic analysis. The corresponding inferred clonal frequencies are listed in the 
bottom table, where each block shows a matrix P for a value of C, and CO denotes the normal clone. 
doi:10.1371/journal.pcbi.1003703.g004 



staining), or (for breast cancer) tlie fraction of cells expressing 
HER-2 or estrogen receptor. The method described here provides 
a framework for linking a tumor's molecular anatomy to its 
structural anatomy as well as its phylogenetic evolution. 

Several lines of evidence support the vahdity of the clonal 
genotypes and relative frequencies inferred by our model. One 
prediction from our phylogenetic reconstruction is that somatic 



variants at the trunk will be present at higher frequencies 
throughout all tumor subsections than variants appearing at the 
branches. While copy number variation across the somatic 
genome complicates these comparisons, one of two closely 
juxtaposed somatic variants (chr9b) is positioned at the trunk of 
our phylogenetic tree, while its neighbor (chr9a) arises in one of 
the branches. Consistent with this representation, the variant allele 
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Figure 5. Clone frequencies vary smoothly across adjacent subsections. The panels display the pattern of inferred clone frequencies across 
subsections (A) PI, (B) P2, (C) P3 and (D) Ml. Each bar plot shows the relative frequencies of tumor clones in the corresponding subsection after 
accounting for normal contamination. Clones are numbered as in Figure 4, and the normal clone, CO, is not shown. This figure shows the C = 4 
solution; Figure SI in Text SI shows the C = 5 and C = 6 solutions. 
doi:1 0.1 371/journal.pcbi.1 003703.g005 



frequencies for chr9b are consistently higher than for chr9a in all 
ten tumor subsections examined. 

Interestingly, phylogenies can be built from the inferred 
genotypes even given the relatively low purity of the tumor 
sections: contamination with normal tissue was > 50% in 9 out of 
12 subsections in our data (Figure 4, C = 3). In particular, 
although we estimate that the metastatic subsections contained 
<20% tumor cells ui Ml-1 and <30% in Ml -2, the correspond- 
ing branch of the phylogenies is stable and consistent. 

Similar to phylogenetic analysis, reassembly of the tumor 
subsections indicates that our assignment of mutations to clones 
produces spatial representations that are anatomically reasonable. 
With further refinements, our method should enable reconstruc- 
tions that layer a tumor's phylogeny on top of its spatial 
organization. 

While our results underscore the potential power of this new 
method, our study also has several limitations. Our assessments 
were confined to heterozygous somatic variants, and did not take 
into account the many chromosomal structural changes that were 



present in the tumor we examined. A comparison of exome copy 
numbers between primary tumor and lymph node indicates that 
the vast majority of these chromosomal changes preceded the 
divergence shown in our phylogenetic tree (Figure S2 in Text SI). 
In theory, one could imagine generalizing our generative model to 
take copy number variations into account by replacing the 2 in the 
denominator of Equation 2 with a hidden random variable for 
each locus, but without some form of aggressive regularization, 
this formulation would lead to a prohibitively complex and overfit 
model. 

Additionally, a key characteristic of our method is the 
requirement to specify the number of clones C prior to the EM 
inference procedure. It is important to recognize that this choice 
should depend upon properties of the data set itself, rather than 
fundamental properties of the cancer. After all, each cell division 
results in multiple mutations, such that every cancer cell 
constitutes a distinct clone. Consequendy, a picture of the fuU 
clonal history of a cancer would consist of a phylogenetic tree with 
one leaf for each cancer cell. In practice, such a tree would be of 
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Figure 6. Cancer phytogenies. Each panel shows the inferred clonal phylogeny assuming (A) C = 3, (B) C = 4, (C) C = 5 and (D) C = 6 clones, where 
CO corresponds to the normal clone. Nodes correspond to inferred clonal populations, and edges are annotated with mutations that occur between 
the parent and child clones. Two mutations are grouped into a colored box if they both occur on the same branch in all four phylogenies. 
doi:1 0.1 371 /journal.pcbi.1 003703.g006 



limited utility and, more importantly, could not be accurately 
estimated from any reasonably sized data set. Perhaps the most 
useful definition of a tumor clone is a population of cells that 
exhibit distinct spatial or functional properties. Our approach 
allows the user to specily the number of clones and, hence, the 
resolution at which the clonal history is viewed. 

Because Clomial does not impose any assumption on the 
distribution of mutation frequencies, the number of inferred clones 
may not exceed the number of samples; otherwise, the resulting 
optimization problem will be under-constrained. 

In the particular cancer studied here, the three-clone solution 
appears to provide an inaccurate view of the clonal history. The 
placement of the chrl7c mutation along the path leading to 
metastatic clone C2 is surprising, given that this particular locus 
has such low counts for both metastatic subsections (2 counts for 
subsection Ml-1 and 0 counts for Ml-2, Table SI in Text SI). 
This apparent anomaly can be explained by the small counts 
associated with chrl7c in four out of the 10 primary tumor 
subsections (3 counts in PI -3, 4 in PI -4, and 21 in each of P2-1 
and P2-2). Faced with the choice of what genotype profile to assign 
to this particular locus, the inference procedure selected a solution 



in which only two subsections, rather than four, are inconsistent. 
However, given the flexibility of a 4-clone model, the anomaly is 
resolved, and chrl7c defines a novel clone C2 that occurs in the 
primary tumor samples and is completely absent from the 
metastatic samples. 

In practice, it may be possible to estimate how many clones the 
data set can resolve using a method such as the Bayesian 
Information Criterion (BIG), with a smaller BIG value indicating a 
better fit to the data [34—36]. This approach has been used 
previously for estimating tumor clonal composition [37,38]. BIG 
analysis of our model on simulated data suggests that, on average, 
the BIG accurately estimates the true number of clones, even in 
the presence of sequencing noise (Figure S3A-B in Text SI). 

We also computed the BIG for models trained on our real breast 
cancer data (Figure S3G in Text SI) and observed a large decrease 
in BIG (45%) when C increases from 3 to 4, suggesting that the 
C = 3 model is too simple to describe the data. However, the 
subsequent improvements of the BIG are smaller: 29%, 20%, 9%, 
and 3% respectively, as C grows from 4 to 8. In general, one 
should avoid increasing the complexity of the model when the BIG 
improvement is small because, in such situations, adding to the 
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number of free parameters of the model can potentially lead to 
over-fitting [39-47]. Note that, as an alternative to a BIG 
approach, one could instead take an approach motivated by 
cross-validation, as has been explored in the context of matrix 
factorization models [48-50]. 

Running the EM algorithm is very fast. In practice, using a 
2.40 GHz processor with 2 GB memory, training a single EM 
instance on the real data set takes a few seconds up to several 
minutes, depending on the value of the hyperparameter C (Figure 
S4 in Text SI). However, because the optimization problem in the 
M step is non-('()u\x'x, many EM instances must be trained from 
different random initializations to avoid local optima. 

We first noted that Clomial achieved good results on simulated 
data using only 10 random initiahzations when C = 3 (Figure 2). 
Then, to further assess the appropriate number of EM instances to 
run, we revisited the solutions from all of our 100,000 EM 
instances, counting how many instances are required to achieve 
the best observed model (Figure S5 in Text SI). In practice, while 
1000 EM instances is sulficient to find the optimum solution when 
C = 2 or 3, a larger number of random initializations is required as 
the number of clones grows. This is an expected phenomenon 
because the complexity of the model grows significantly with C, 
resulting in an optimization surface with many more local optima. 
Consequentiy, despite the highly parallel nature of the computa- 
tion, scaling up to analysis of larger data set with larger numbers of 
clones win likely require improved EM training strategies, such as 
noise injection or regularization. 

Final])', although we used a simple phylogenetic tree construc- 
tion procedure to evaluate the quality of our inferred clonal 
genotypes, the EM inference procedure described here does not 
explicitiy model tumor evolution. Ultimately, we aim to produce a 
model that automatically infers not only clonal genotypes and 
clonal frequencies, but also the number of clones and the 
phylogenetic tree relating them. 

Our method differs significantly from other approaches. A 
recent characterization of 21 breast cancers defined clones by 
clustering mutations with similar variant allele frequencies [9]. 
The success of this strategy hinges on characterizing the 
frequencies of large numbers (hundreds or thousands) of somatic 
variants. In contrast, our method can reconstruct clonal phylog- 
enies based on accurately measuring alleles of much smaller 
numbers of somatic variants. The view afforded by our method 
may provide novel insights into tumor biology. In particular, 
results from Nik-Zainal and colleagues [9] were interpreted to 
indicate that cancers become clinically appar(-nt only after one of 
the competing clones has achieved clonal dominance. In contrast 
to this "winner takes all" hypothesis, our model suggests that some 
cancers might be more accurately regarded as ecosystems, in 
which clones may be subject to spatial influences that affect their 
competitive fitness, or may even collaborate to support tumor 
growth. 

An important difference between our method and many otluT 
methods based on clustering [8,9,12] is our explicit probabilistic 
modeling of the random selection of normal and variant alleles 
during sequencing, according to a binomial distribution. By taking 
into account not just the relative frequency of the two alleles but 
the separate counts of normal and variant alleles, our model 
automatically assigns less importance to a locus with lower 
coverage, even if the locus yields the same variant allele frequency 
as a high-coverage locus. 

While this manuscript was under review, two methods called 
PyClone [1,3] and PhyloSub [51] were published, which do model 
allele counts using a binomial distribution. These methods attempt 
to simultaneously infer not only clonal genotypes and frequencies. 



as Clomial does, but also infer the number of clones and their 
phylogeny. Furthermore, PyClone and PhyloSub are not limited, 
as Clomial is, to situations in which the number of inferred clones 
is less than or equal to the number of available samples. How is 
this possible? To make these inferences feasible, these clustering 
methods must make certain distributional assumptions about the 
data. Specifically, PyClone assumes a Dirichlet Process prior for 
clone frequencies, where the base distribution is Uniform [0,1] and 
the concentration parameter is Gamma distributed with shape and 
scale parameters equal to 1 and 0.001, respectively. PhyloSub 
extends PyClone by using a tree-structured stick-breaking process 
[52] to directiy account for phylogenetic relationships during the 
inference. In principle, these assumptions enable PyClone and 
PhyloSub to infer information about a large number of clones 
from only a single sample. On the other hand, when multiple 
samples are available, Clomial can draw accurate inferences 
without requiring these distributional assumptions. In practice, our 
comparison showed that Clomial and PhyloSub produce similar 
results on three previously described chronic lymphocytic leuke- 
mia (CLE) cases [53] (Tables S3-S5 in Text SI). 

We note that if e, is the sequencing error rate at locus /, then the 
probability of observing a variant allele at this locus in subsection j 
is estimated by 7Cy(l — e,) + (l — 7iy)e,. In principle, sequencing 
noise could be incorporated into our model by replacing Tty, 
defined in Equation 2, with — 2e,)-|-e,- in the likelihood and 
EM algorithm. However, given the robustness of the current 
method to noise (Figures S6 and S3C in Text SI), we opted to 
keep our model simple. In future applications, it may be beneficial 
to model noise in data produced by sequencing technologies that 
exhibit high error rates (>0.02%) such as PacBio RS [54]. 

The EM algorithm is not the only option for maximizing the 
log-likelihood for the observed data. In particular, one could 
instead treat both Z, and P as optimization variables and seek to 
maximize £(X|Z,P) with respect to Z and P. This would amount 
to iteratively updating Z and then updating P until convergence, 
similar to the iterative algorithms typically used for matrix 
factorization models [23-25,50]. However, this alternative ap- 
proach would not have any computational advantage in terms of 
the update for P, which would still not have a closed-form 
solution, and would need to be solved using BFGS-B or an 
equivalent approach. Furthermore, the update for Z would be 
very complicated under the constraint that Z is a binary matrix. 
Therefore, we developed a customized inference algorithm based 
on EM. 

Whereas genetic testing for cancer patients today focuses on 
mutations affecting a relatively small number of cancer-associated 
genes, most cancers are sustained by networks of aberrantiy 
regulated genes that collaborate to promote tumor growth. The 
ability to assign mutations to clones, and to layer a tumor's clonal 
content on top of its structural anatomy in space and over time, 
can provide new insights into the mechanisms that enable cancers 
to invade, metastasize and escape treatment. 

Materials and Methods 

Ethics statement 

This research was reviewed and approved by the Cancer 
Consortium Institutional Review Board (IRB) located at the Fred 
Hutchinson Cancer Research Center (FHCRC). The FHCRC has 
an approved Federalwide Assurance on file with the Office for 
Human Research Protections (number 00001920). The Federal- 
wide Assurance is a formal written, binding commitment that 
assures that the FHCRC promises to comply with the regulations 
and ethical guidelines governing research with human subjects, as 
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stipulated by the U.S. Department of Health and Human Services 
under 45 CFR 46. Because this study involved the use of de- 
identified specimens obtained from an IRB-approved repository, 
we did not interface with patients. Patient consent was adminis- 
tered, in compliance with 45 CFR 46, by investigators who 
maintain the repository. Patients gave their consent for their 
specimens to be stored in the repository and subsequently used for 
research in cancer. The FHCRC IRB deemed that our research 
was in concordance with the purpose of the registry and the 
patient informed consent. 

Breast cancer tissue sample 

We obtained breast cancer tissues from the Breast Cancer 
Biospecimen Repository of Fred Hutchinson Cancer Research 
Center after IRB approval. The patient was a 44 year old pre- 
menopausal woman diagnosed with infiltrative ductal carcinoma 
(IDC) and ductal carcinoma in situ (DCIS), stage pTlc pNl, 
Grade II/III, ER positive, PR positive and Her2 negative. AxUlary 
lymph node dissection revealed that one out of 13 nodes was 
positive for metastatic disease. A total of 5 pieces were obtained 
from surgical samples including 1 tissue section from adjacent 
normal breast tissue (Nl), 3 tissue sections from the primary breast 
cancer (PI, P2, P3), and 1 tissue section from the positive axillary 
lymph node (Ml). Each section is about 1 cm by 1 cm by 0.5 cm. 
The tumor content, including both IDC and DCIS, ranges from 
40% to 55% in the primary tumor and axillary lymph node tissue 
sections based on pathological examination (PI 55% IDC, P2 45% 
IDC, P3 40% IDC and 15% DCIS, Ml 50% IDC). 

Tissue DNA extraction 

Each individual section was subdivided into multiple subsec- 
tions, and the anatomic locations of all the subsections were 
recorded (Figure 3). Using Qiagen AUPrep DNA/RNA Micro Kit, 
DNA was extracted from one normal subsection (Nl-1), seven 
primary subsections (Pl-2, Pl-3, Pl-5, P2-1, P2-3, P3-3, P3-4) and 
one metastatic subsection (Ml-1). After quantification, all the 
DNA samples were subjected to exome capture followed by 
lUumina sequencing. 

Whole exome sequencing 

Next generation sequencing was carried out at the Northwest 
Genome Center at University of Washington on the normal 
subsection, seven primary subsections, and one metastatic 
subsection. For each subsection, one microgram of genomic 
DNA was used to construct the random-shearing library per 
standard protocol with Covaris acoustic sonication. Libraries then 
underwent exome capture using the ~36.5 Mb target from 
Roche/Nimblegen SeqCap EZ v2.0 (~ 80,000 exons and flanking 
sequence). Since each library was uniquely barcoded, samples 
were performed in multiplex. Massively parallel sequencing was 
carried out on the HiSeq sequencer. 

Read processing 

Sequence reads were processed with a pipeline consisting of the 
following elements: (1) base calls generated in real-time on the 
HiSeq instrument (RTA 1.12.4.2); (2) Perl scripts developed in- 
house to produce demultiplexed fastq files by lane and index 
sequence; (3) demultiplexed BAM files aligned to a human 
reference (hgl9) using BWA (Burrows-Wheeler Aligner; vO.5.9) 
[55]. Read-pairs not mapping within +2 standard deviations of 
the average library size {~ 125+ 15 bp for exomes) are removed. 
AH aligned read data were subjected to the following steps: (1) 
"duphcate removal" was performed, (i.e., the removal of reads 



with duphcate start positions; Picard MarkDuplicates; vl.l4); (2) 
indel realignment was performed (GATK IndelRealigner; vl.O- 
6 1 25) resulting in improved base placement and lower false variant 
calls; (3) base qualities were recalibrated (GATK TableRecalibra- 
tion; vl. 0-6 125). All sequence data then underwent a previously 
described quality control protocol [56] . 

Variant detection 

Variant detection and genotyping were performed using the 
UnifiedGenotyper tool from GATK (v 1.0-6 125). Variant data for 
each sample were formatted (variant call format) as "raw" calls 
that contain individual genotype data for one or multiple samples, 
and flagged using the filtration walker (GATK) to mark sites that 
are of lower quality/false positives, e.g., low quality scores (<50), 
allelic imbalance (>0.75), long homopolymer runs (>3) and/or 
low quality by depth (QD < 5). 

Calling single nucleotide variants (SNVs) and indels 

Most of the commonly used software for calling SNVs and 
indels, including SNVMix [57] and VarScan [58], requires tumor 
content > 80%. To allow identification of low frequency alleles 
that occur in only one or a few subsections, we did not pool all of 
the data together. Instead, we designed a method that is 
appropriate for multiple samples from one patient, with relatively 
low tumor content, ranging from 45% to 55%. At each 
chromosomal position (locus), we considered six mutually exclusive 
possible outcomes: A, C, G, T, deletion, and unknown. The counts 
of these six outcomes at each locus between normal and each of 
the multiple tumor subsections were compared with a 2x6 
Fisher's exact test. To correct for multiple testing, we used the 
qvalue R package to convert p — values to q — values. Only those 
chromosomal loci with ^<0.1 in at least one comparison between 
normal and tumor samples were accepted for downstream 
analysis. This analysis identified 6310 loci. 

For each accepted locus, we used a heuristic procedure to identify 
which of the six alleles differed between the tumor and normal 
sample. For each subsection, we carried out six 2 x 2 Fisher's exact 
tests, one for each of the six possible alleles. Thus, each such test 
compared one allele's counts to the sum of the counts for the other 
five alleles. Using a p-value threshold of 0.01, an allele was declared 
to be increased, decreased, or unchanged in the tumor subsection as 
compared to the normal sample. The changes that were classified as 
"increased" and had a normal count of zero were called tumor- 
specific mutations. This procedure identified a total of 268 such 
tumor-specific mutations, with a mean and median serjuencing 
depth of 92 and 75, respectively. Corresponding annotations were 
obtained from SeattleSeq (http://snp.gs.washington.edu/ 
SeattieSeqAnnotationl 37). 

In parallel, we also analyzed our data using deepSNV [31] by 
comparing the normal subsection to the 8 tumor subsections. We 
ran deepSNV on the loci with total co\'erage across all samples 
more than 50, which resulted in th(- identification of 29 loci with 
q<0.l. The union of the two lists yielded 281 loci for further 
validation (Dataset SI). 

Targeted deep sequencing 

Mutations were \'alidated by targeted deep sequencing of DNA 
derived from one normal subsection (Nl-1), 10 primary subsec- 
tions (P 1-1, Pl-2, Pl-3, PI -4, P2-1, P2-2, P2-3, P2-4, P3-1, P3-2) 
and two metastatic subsections (Ml-1 and Ml-2). The subsections 
were selected to have low normal content and to span the tumor 
anatomy. Genomic DNA was prepared as described for the initial 
exome sequencing. A HaloPlex probe capture library for selective 
capture of 281 target loci was generated with SureDesign (Agilent 
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Technologies). Target enrichment for deep sequencing was 
carried out with the HaloPlexTM Target Enrichment System 
from Agilent Technologies following the manufacturer's 
protocol. Triplicate enrichments were performed for each sample. 
Target-enriched samples were sequenced using a MiSeq (lUu- 
mina). Of the 281 target loci, 244 were successfully sequenced with 
coverage more than 100 reads for the normal sample. The mean, 
median, and the standard deviation of the coverage were 1615, 
1118, and 1600, respectively (Dataset S2). 

All 244 loci were visualized using the Integrative Genomics 
Viewer [59,60]. A set of 17 loci were selected based upon three 
criteria: (1) at least 3 reads cover the locus in the normal sample, 
(2) the variant allele is not present in the normal tissue (allowing for 
a few variant counts, which may reflect sequencing error) and (3) 
there are no nearby clustered mutations, indicative of sequencing 
or mapping error. Independently, the data were also analyzed 
using deepSNV. Applying a q — value threshold of ^<10^^ 
yielded 1 9 loci, including all 1 7 of the initially selected loci. The 1 7 
loci were retained for downstream analysis (Table SI in Text SI). 

Bayesian Information Criterion analysis 

We computed BIG using the following formula: 

BTC= -2C*+{NC+MC-M)\og{\\R\\^), (14) 

where C* is the expectation of the complete-data log likelihood, 
which is maximized in the last M step (see Equations 7 and 13). 
Also, {NC + MC — M) represents the total number of free 
parameters, and = ^Rij is the total number of counts. 

Supporting Information 

Dataset SI Dataset SI includes, for each of the 281 targeted 
loci, the following information: (1) the chromosomal coordinates of 
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